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ABSTRACT 


Modeling student learning processes is highly complex since 
it is influenced by many factors such as motivation and 
learning habits. The high volume of features and tools pro- 
vided by computer-based learning environments confounds 
the task of tracking student knowledge even further. Deep 
Learning models such as Long-Short Term Memory (LSTMs) 
and classic Markovian models such as Bayesian Knowledge 
Tracing (BKT) have been successfully applied for student 
modeling. However, much of this prior work is designed to 
handle sequences of events with discrete timesteps, rather 
than considering the continuous aspect of time. Given that 
time elapsed between successive elements in a student’s tra- 
jectory can vary from seconds to days, we applied a Time- 
aware LSTM (T-LSTM) to model the dynamics of student 
knowledge state in continuous time. We investigate the ef- 
fectiveness of T-LSTM on two domains with very different 
characteristics. One involves an open-ended programming 
environment where students can self-pace their progress and 
T-LSTM is compared against LSTM, Recent Temporal Pat- 
tern Mining, and the classic Logistic Regression (LR) on the 
early prediction of student success; the other involves a clas- 
sic tutor-driven intelligent tutoring system where the tutor 
scaffolds the student learning step by step and T-LSTM is 
compared with LSTM, LR, and BKT on the early predic- 
tion of student learning gains. Our results show that T- 
LSTM significantly outperforms the other methods on the 
self-paced, open-ended programming environment; while on 
the tutor-driven ITS, it ties with LSTM and outperforms 
both LR and BKT. In other words, while time-irregularity 
exists in both datasets, T-LSTM works significantly better 
than other student models when the pace is driven by stu- 
dents. On the other hand, when such irregularity results 
from the tutor, T-LSTM was not superior to other models 
but its performance was not hurt either. 
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1. INTRODUCTION 


Student Modeling sits at the epicenter of educational data 
mining. It monitors a student’s progress, ability, or knowl- 
edge over a set of skills and can predict the student’s future 
performance based on historical sequence data. In recent 
years, recurrent neural network architectures, such as Long 
Short-Term Memory (LSTMs), have become the workhorses 
for modeling sequence data in a variety of tasks involving se- 
quential data, such as video processing, climate change de- 
tection, and patient disease progression prediction [20, 19, 
25, 12]. Deep Knowledge Tracing [35, DKT], the first LSTM 
approach in student modeling, reported an impressive im- 
provement over a classical statistical model Bayesian Knowl- 
edge Tracing [10, BKT]. Both LSTM/DKT and BKT are de- 
signed to handle sequences of events with discrete timesteps, 
not considering the continuous aspect of time. 


On the other hand, student response time, the elapsed times 
between consecutive elements of a sequence can vary greatly 
by student, from seconds to days. Ever since the mid-1950s, 
student response time has been used as a preferred educa- 
tional assessment to evaluate how active and accessible stu- 
dent knowledge is in cognitive psychology [43]. For example, 
it has been shown that response time reveals student pro- 
ficiency [40] and there is a significant negative correlation 
between student average response time and student final 
exam score taken at the end of the semester [16]. Addi- 
tionally, response time has been suggested as an indicator 
of student engagement in answering questions [21] as well as 
an important factor for predicting motivation in learning en- 
vironments [9]. Also, by leveraging time information, BKT 
prediction performance can be improved [38, 44]. Therefore, 
by not taking the time intervals into consideration, the de- 
sign of traditional LSTM and BKT may lead to sub-optimal 
performance for modeling student learning. 


Previous work for modeling sequence data has explored sev- 
eral ways to handle time irregularity [3, 34, 8, 6] and among 
them, Time-aware LSTM (T-LSTM) is one of the most state- 
of-the-art models [3]. T-LSTM transforms time intervals 
between successive elements into weights and uses them to 
adjust the memory passed from previous moments. In this 
work, we apply T-LSTM to model the dynamics of student 
knowledge state in continuous time and conduct two empiri- 
cal comparisons between T-LSTM and the standard LSTM, 
Recent Pattern Mining [23], and classical student model- 
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ing methods such as BKT and logistic regression models on 
two real-world data sets collected from two learning environ- 
ments with very different characteristics. One is an open- 
ended block-based programming environment for a novice 
programming task where students are free to explore the 
environment with minimal system support or constraints. 
Each student’s log file is a trajectory of actions with corre- 
sponding time stamps and time intervals calculated between 
the two consecutive student actions. The other probabil- 
ity tutor is tutor-driven in that the tutor decides what to 
do next. Each student’s log file is a trajectory of student- 
ITS interactions. In each interaction, the tutor first elicits 
the subsequent step from a student with prompting, and 
when the student performs a step, the tutor records its suc- 
cess or failure and may give feedback (e.g. correct /incorrect 
markings); if the student’s answer is incorrect, the tutor 
provides a series of hints from general to specific and the 
bottom-out hint tells the student exactly what to do. The 
interaction is ended only when a step is correctly answered 
and the tutor moves to the next interaction. As a result, 
each student’s log file is a trajectory of tutor actions mixed 
with student’s responses with corresponding time stamps. In 
this environment, the time intervals are calculated between 
the student’s first attempt on one problem and the next. 
Our research question is: By taking time-awareness into 
consideration, would T-LSTM outperform other tra- 
ditional student modeling methods on both self-paced 
and tutor-driven learning environments? 


2. METHODS 


2.1 Long Short-Term Memory 

Long Short Term Memory [18, LSTM] is a special type of 
RNN which is explicitly designed to avoid the long-term de- 
pendency problem. LSTM can avoid the vanishing (and 
exploding) gradient problem and works tremendously well 
on a large variety of problems. 


Figure 1: The Structure of a LSTM Unit 


The internal structure of each LSTM module is shown in 
Figure 1. There are three major components: a forget gate, 
an input gate, and an output gate in a standard LSTM unit 
cell, where these components interact with each other to 
control how information flows. In the first step, a function 
of the previous hidden state ht-1 and the new input X; 
passes through the forget gate, indicating what is probably 
irrelevant and can be taken out of the cell state. The for- 


get component will calculate a weight f; between 0 to 1 for 
each element in hidden state vector C;_1. An element with 
a weight of 0 should be completely forgotten whereas an el- 
ement with a weight of 1 needs to be entirely remembered. 
The formula to calculate f; is shown below where Wy and 
by are the weights and intercepts, respectively, for the forget 
component. 


fi = sigmoid(Wy - [he—1, ve] + by) (1) 


There are two steps involved in input component’s calcula- 
tion. In the first step, a tanh layer calculates a candidate 
vector C; that could be added to the current hidden state. 
In the second step, the input components calculate a weight 
vector i; (ranging from 0 to 1) to determine to what extent 
Ct should update the current memory state. 


C, = tanh(W. - [he—1, xt] + be) (2) 


Ut = sigmoid(W; ‘ [he-1, xe] + bi) (3) 


With the forget and input components, the module is able 
to throw away the expired information in the previous cell 
state by calculating Cr-1- ft, and process new information 
by computing C; - 74. Consequently, the formula to update 
the current memory cell is shown below. Note that the cur- 
rent memory cell state C; is then passed to the next LSTM 
module. 


Cy = Cr-1- fe + Cy “tt (4) 


Finally, the output component is simply an activation func- 
tion that filters elements in C;. The C; can be converted to 
a value between -1 to 1 by the tanh function. The output 
component calculates a weight vector 


or = sigmoid(W, - [he_-1, Le] + bo) (5) 


that determines how much information is allowed to be re- 
vealed. 


Ct = o¢ * tanh(C;) (6) 


With such a gated structure, LSTM is capable of handling 
long-term dependencies. 


2.2 Time-Aware Long Short Term Memory 
The standard LSTM assumes that the elapsed times be- 
tween elements of a sequence are uniformly distributed, and 
therefore it is designed to handle sequences with discrete 
timesteps. However, in the educational domain, the interval 
between two consecutive steps during a student trajectory 
can span from seconds to days. In general, the events that 
occurred long ago tend to have less impact to the current 
state and thus we should properly reduce their contributions. 
Therefore, it is important to consider the elapsed time when 
predicting the current event’s output. In this work, we ap- 
plied Time-aware LSTM [3, T-LSTM], which is proposed to 
handle the temporal dynamics of sequential data with time 
irregularities, to model student knowledge states in contin- 
uous time. 


The T-LSTM architecture is shown in Figure 2. To fit in 
our domain, we represent the input sequence by the stu- 
dent trajectories. Apart from the three gates in standard 
LSTM: forget, input, and output; T-LSTM also integrates 
the time elapsed between successive records into the net- 
work architecture, and we call this as the time decay compo- 
nent. The information stored in the memory of the previous 
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Figure 2: The Structure of a T-LSTM Unit 


hidden state C;-1 is decomposed into two parts: long-term 
memory and short-term memory. Without losing the long- 
term memory contained in C_1, the time decay component 
mainly plays a role to adjust the short-term memory by em- 
ploying the elapsed time between successive steps. If the 
gap between two steps is significantly huge, e.g. few hours 
in our domain, it means there has been a long time with no 
interaction between students and the tutor/computer. In 
that case, there is not much point to heavily rely on the 
previous short-term memory to predict the current output. 
In the framework of T-LSTM, a non-increasing function of 
the elapsed time is applied to transform the time duration 
into an appropriate weight. And in this work, we applied 
g(At) = 1/log(e + At) to get the corresponding weights. 


The following calculations are involved in the time decay 
component of T-LSTM. First, short-term memory Ce, is 
calculated. 


C?_, = tanh (Wa- Cy-1 + ba) (7) 


The long-term memory can be obtained by deducting short- 
term memory from the previous hidden state. 


Cla a= G4 = Cry (8) 


Then C?_, is discounted by the elapsed time weight to ob- 
tain the discounted short-term memory (Cee 


Coy = Cra * g(At) (9) 


Finally, the adjusted previous hidden state C/_, is com- 
posed by adding long-term memory and discounted short- 
term memory. 


Opa =Crat+ Ce 4 (10) 


The following parts are very similar to standard LSTM. Fol- 
lowing the steps in Section 2.1, we first calculate the forget 
gate f:, candidate vector C; and input gate i, by applying 
Equation (1), (2) and (3). For the calculation of the current 
memory cell state Cz, the adjusted previous hidden state 
Cf#_1 instead of Cy—1 is applied in the T-LSTM framework. 


Cr = Cra fet Geet (11) 


The final output for the current state can be achieved using 
the following Equation (6). In this work, we investigate the 
effectiveness of T-LSTM via the early prediction of both stu- 
dent success and learning gains. As far as we know, no prior 
studies have explored T-LSTM on both computer-based pro- 
gramming systems and intelligent tutoring systems. 


2.3 Recent Temporal Pattern Mining 

The Recent Temporal Pattern mining (RTP) framework [2] 
was originally proposed to find predictive patterns from com- 
plex multivariate time series data. This framework first con- 
verts time series into time-interval sequences of temporal 
abstractions, and then constructs more complex temporal 
patterns backwards. The following part will explain how 
the RTP framework is applied in our work. 


Multivariate State Sequences: We denote a State S as 
(F,V), where F is a temporal feature and V is the value 
for feature F' at a given time and the State Interval E 
is denoted as (F,V,s,e), where s and e refer to the start 
and end times of the state (F,V). Thus, we can convert 
each student’s data x; into a corresponding Multivariate 
State Sequence (MSS) z; by sorting all the state intervals by 
their start times: z; = (Fi, F2,..., En) : Bj.s < Ej41.8,9 € 
{1,...,n —1}. And we apply two temporal relations in this 
work: 1); before(b) £;: When £; ends before the start 
of E; (E;.e < Ej.s); 2) EB; co-occurs(c) with Ej: When E; 
and E; have some overlap (Ej.s < Ej.s < Ej.e). 


Recent Temporal Patterns: Here, we call a state interval 
E = (F,V,s,e) a Recent State Interval of MSS z if: 1) 
F is the last state interval for feature F’; that is, for all 


, 


E= (FLV ,s,e), we have E’.e < E.e; or 2) E is less 
than g time units away from the end time of the last state 
interval: z;.end; that is, z;.end— E.e < g. 


Given an MSS z;, a temporal pattern P = ((Si,...,5n), R), 
and a maximum gap parameter g, we say P is a recent tem- 
poral pattern (RTP) in z;, denoted R,(P, zi), if all 3 of the 
following conditions hold: 1) 2; contains P, where P € 2; 
if: (a) z; contains all k states of P, and (b) all temporal 
relations of P are satisfied in z;; 2) Sn = (Fn, Vn) matches 
a recent state interval in z;; and 3) Every consecutive pair 
of states in P maps to a state interval less than g time units 
apart. That is, each pair of temporal sequences should not 
be g time units apart. In short, parameter g forces pat- 
terns to be close to the end of the sequence 2;, and forces 
consecutive states to be close to each other. 


Mining Algorithm: Taking student success classification 
as an example, we will have two sets of labeled MSSs: 71 = 
{zi : yi = 1} for all unsuccessful sequences and Zo = {2; : 
yi = 0} for all successful ones. Given Z1, the mining al- 
gorithm applies a level-wise search to find frequent RTPs. 
More specifically, it first starts with all frequent 1-RTPs, and 
then extends the patterns by adding a new state to each se- 
quence, one at a time, until no new patterns are discovered. 
That is, at each level k, the algorithm finds frequent (k+1)- 
RTPs by repeatedly extending k-RTPs through Backward 
candidate generation, and the Counting phase, as described 
below. 


Backward (k + 1)-pattern candidates are generated from a 
k-pattern P = ((S1,...,Sx),R), by adding a new frequent 
state, Snew, to the beginning of the sequence to create P= 
((Snew, 51, ---, 5); F ). Then we specify the new before (b) 
or co-occurs (c) relations R between Sprew and all original 
k states, restricted by the following two criteria: 1) Two 
state intervals of the same temporal feature cannot co-occur. 
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Snow |b] |] Srow| Snw | ¢ | 
Si |» | Si |» | Si | » | 


Figure 3: An example of generating 3-patterns out 
of a single 2-RTP, by appending a new state. 


That is, if Snew.F = S;.F for i € {1,...,k}, then Riew si F 
c. 2) Since the state sequence in pattern P is sorted by 
the start time of the states, once a relation becomes before: 
Riewi = b for any i € {1,..., k}, all of the following relations 
have to be before, so Riew,j = b for j € {i+ 1,..., k}. 


In the Counting phase, candidate (k + 1)-patterns are re- 
moved if they do not meet the minimum support threshold 
by occurring at least o times as RTPs in Z;. The same 
procedure is carried out for Zo. Finally, we combine all the 
frequent RTPs into a final 2 set of RTPs. 


Binary Matrix Transformation: We transform each MSS 
zi € Z into a binary vector v; of size |Q|, such that each 0 
and 1 indicates whether the pattern P; € Q is a recent tem- 
poral pattern in Z; or not. This will result in a binary matrix 
of size N x |Q|, which represents our original dataset. 


2.4 Bayesian Knowledge Tracing 

BKT is a student modeling method extensively used in ITSs. 
Figure 4 shows a graphical representation of the model and a 
possible sequence of student observations. The shaded nodes 
S represent hidden knowledge states. The unshaded nodes 
O represent observation of students’ behaviors. The edges 
between the nodes represent their conditional dependence. 


Correct 


Correct 


Incorrect 


Figure 4: The Bayesian network topology of the 
standard Knowledge Tracing model 


Fundamentally, the BKT model is a two-state Hidden Markov 
Model [11, HMM] characterized by five basic elements: 1) 

N, the number of different types of hidden state; 2) M, the 

number of different types of observation; 3) II, the initial 

state distribution P(.So); 4) T, the state transition probabil- 

ity P(St41|S:) and 5) E, the emission probability P(Oz|S:). 

Note that both N and M are predefined before training 

occurs, while II, T and E are learned from the students’ 

observation sequence. 


Conventional BKT assumes there are two types of hidden 
knowledge states (N=2) corresponding to student knowl- 
edge states of unlearned and learned. It also assumes there 
are two types of student observation (M=2) correspond- 
ing to student performance of incorrect and correct. BKT 
makes two assumptions about its conditional dependence 
as reflected in the edges in Figure 4. The first assumption 
BKT makes is a student’s knowledge state at a time ¢ is 
only contingent on her knowledge state at time t— 1. The 
second assumption is a student’s performance at time t is 
only dependent on her current knowledge state. These two 
assumptions are captured by the state transition probability 
T and the emission probability E.In the context of student 
learning, BKT further defines five parameters: 


Prior Knowledge = P(S o=learned) 
Learning Rate = P(learned|unlearned ) 
Forget = P(unlearned | learned) 

Guess = P(correct | unlearned) 

Slip = P(incorrect | learned) 


In order to apply BKT to our dataset, we captured and 
mapped all students’ actions based on the learning oppor- 
tunities of knowledge components (KCs) step by step. For 
each of the KC, the Baum-Welch algorithm (or EM method) 
is used to iteratively update the model’s parameters until a 
maximized probability of observing the training sequence is 
achieved. 


3. EXPERIMENTS 


In this work, we explored different student modeling tasks 
based on characteristics of two different learning environ- 
ments. One was the task of early prediction of student suc- 
cess in an open-ended, self-paced programming environment 
while the other is the task of early prediction of student 
learning gains within a tutor-paced probability tutor. 


3.1 Predicting Student Success on iSnap 
3.1.1 iSnap 


iSnap’ is an extension to Snap! [15], a block-based pro- 
gramming environment, used in an introductory computing 
course for non-majors in a public university in the United 
States [37]. iSnap extends Snap! by providing students with 
data-driven hints derived from historical correct student so- 
lutions [36]. In addition, iSnap logs all students actions while 
programming (e.g. adding or deleting a block), as a trace, 
allowing us to detect the sequences of all student steps, as 
well as the time taken for each step. In this work, we focused 
on one homework exercise named Squiral, derived from the 
BJC curriculum [15]. In Squiral, students are asked to write 
a procedure that draws a square-like spiral. As shown in 
Figure 5, correct solutions require procedures, loops, and 
variables using at least 7 lines of code. We collected stu- 
dents’ data for Squiral from Spring 2016, Fall 2016, Spring 
2017, and Fall 2017. We excluded students who requested 
hints from iSnap to eliminate factors that might affect stu- 
dents’ problem-solving progress, leaving a total of 65, 38, 29, 


‘All tutors and assignments names have been blinded for 
anonymous review 
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and 39 student code traces from each semester, respectively. 
The detailed statistics for isnap dataset are shown in Table 
1. 


The data collected from iSnap consists of a code trace for 
each student’s attempt. This code trace represents a se- 
quence of timestamped snapshots of student code. We used 
an expert feature detector (EFD), described in [49], that 
automatically detects 7 features of a correct solution in a 
student snapshot. For example, for each snapshot in a stu- 
dent code trace, the EFD outputs a feature state, which is 
a series of Os and 1s (e.g. 10000001) indicating the absence 
or presence of each feature, such that feature-state: 1000001 
shows that feature 1 and feature 7 are present, while the 
other 5 features are not. We ran the expert-feature detector 
to tag each snapshot in all 171 code traces, making a total 
of 31,064 tagged snapshots. 


go to 


glide @B secs to x: GD y: 


change x by ED 
set x to @ 
change y by @ED 
setyto@ 


iffon edge, bounce 


OK) Apply) _Cancel 


Figure 5: The iSnap interface, with the blocks 
palette on the left, the output stage on the right, 
the scripting area in the middle, and the hints but- 
ton on top. 


3.1.2 Student Success 

In the context of iSnap, all the models were measured on 
the task of predicting student success. We classify the stu- 
dents who finished the programming assignment in one hour 
or less and got full credit as successful and labeled with “0”, 
those who either failed to complete or submit the assignment 
within one hour as unsuccessful, labeled with “1”. The one- 
hour cutoff was chosen based on a distribution showing that 
the vast majority of students (around 94%) who complete 
the assignment with full credit do so within one hour. Thus, 
each trajectory is assigned one ground truth label based on 
whether the student finished the assignment successfully or 
unsuccessfully. As a result, we refer to this task as the early 
prediction task for student success. Based on this definition, 
59 of 171 students are in the successful group, and the re- 
maining 112 are in the unsuccessful group. Note that this is 
a homework assignment that counts for only a small portion 
of a student’s overall grade, and this behavior (of not at- 
tempting to obtain full credit) is typical in this introductory 
level. 


To predict student success, we are given the first up to n 
minutes of a student’s sequence data and our goal is to pre- 
dict whether the student will successfully complete the pro- 
gramming assignment at any given point in the remainder 
of the sequence. To conduct this task, we left-aligned all the 
students’ trajectories by their starting times and our obser- 
vation window (the part of data used to train and test dif- 


ferent machine learning models) includes the sequences from 
the very beginning to the first n minutes. If a student’s tra- 
jectory is less than n minutes, our observation window will 
include their entire sequence except the last one. 


3.1.3 Four Models 

In the task of early prediction of student success, we have 
four models involved: Logistic Regression (LR), RTP, LSTM 
and T-LSTM. Note that BKT is not included here because 
for the open-ended domain like iSnap, there are no pre- 
defined steps or knowledge components that students must 
achieve to complete a given program. Thus, it is hard to map 
student actions on iSnap to learning opportunities defined 
in BKT. 


Logistic Regression (LR): Since LR do not handle se- 
quence data directly, we used a “Last Value” approach to 
treat the last measurement of each attribute within the given 
observation window as the input to train models. For early 
prediction settings, we truncated all the sequences in the 
training dataset in the same fashion as the testing dataset 
and then applied the Last Value approach on the truncated 
training dataset. For example, when our observation win- 
dow is 6 minute, we apply the last value before 6 minutes 
for each sequence and treat them as inputs for LR. 


RTP: For the RTP-based model, we first used RTP mining 
to generate the binary matrix and then applied LR to learn 
from the generated binary matrix. For early prediction, we 
only apply the truncated training sequences included in ob- 
servation window to find RTPs. For example, for our 6- 
minute observation window, only the first 6 minutes of se- 
quences were used for pattern extraction. 


LSTM and T-LSTM: For LSTM the input is a multivari- 
ate temporal sequence from student work, and the output 
from the last step is used to make a prediction. While for 
T-LSTM, we also feed it with another sequence indicating 
time intervals for each student. As shown in Table 1, the 
time intervals of iSnap range from 1 to 291 seconds across 
four semesters, with = 0.613 and o = 0.217 for the over- 
all decayed intervals. For both LSTM and T-LSTM, we 
used one hidden layer with 128 hidden neurons and set the 
maximum length to accommodate the longest sequence in 
our data. Typically for deep learning models, the whole 
multivariate time series from student sequence data is used 
as input data. However, for early prediction, only those 
events happening within our observation window from each 
sequence were used. 


3.2 Predicting Learning Gains on Pyrenees 


3.2.1 Pyrenees 

Pyrenees is a web-based ITS teaching probability, which 
covers 10 major knowledge components (KCs), such as the 
Addition Theorem, the Complement Theorem, and Bayes’ 
Rule, etc. Domain experts both identified the 10 KCs and la- 
beled each step/exercise with the corresponding KCs, kappa 
> 0.9. Figure 6 shows the interface of Pyrenees which con- 
sists of a problem statement window, a variable window, 
an equation window, and a tutor-student dialogue window. 
Through the dialogue window, Pyrenees provides messages 
to the students. It can explain a worked example or prompt 
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Table 1: Detailed data statistics for iSnap, including total steps, total time spent in minutes, time intervals 
in seconds, corresponding decayed time intervals, and the success labels distribution for each of the four 


semesters. 
Sameatar Total Steps Total Time (minutes) Time Intervals (seconds) Decayed Time Intervals | Success Labels 
min max median mean(std) | min max median mean(std) min max median mean (std) mean(std) Ss U 
S16 10 1024 169 199 (175) | 0.533 95.667 — 20.733 22.777 (17.149) 1 209 2 6.739 (13.75) 0.628 (0.217) 23 42 
F16 28 884 121 167 (168) | 3.283 119.083 16.325 22.379 Be 177 1 189 3 ic " (14.12) 0.594 (0.217) 15 23 
S17 15 439 75 112 (94) 2.817 62.983 14.167 16.347 (11.872 1 177 3 2 (16.14) 0.599 (0.225) 12 17 
F17 10 = 2276 100 219 (376) 1.65 189.667 19.1 28.224 (33.869 1 291 3 : a7 (15.61) 0.609 (0.215) 9 30 


the student to complete the next step. Students can en- 
ter their inputs in the text area. Any variable or equa- 
tion that is defined through this process is displayed on the 
left side of the screen for reference. Pyrenees can also pro- 
vide on-demand hints. The bottom-out hint tells the stu- 
dent exactly how to solve a problem. Different from iSnap, 
the Pyrenees tutor provides immediate feedback for correct- 
ness/incorrectness whenever an answer is submitted. 


Probie 
Events A, B and C have probabilities 0.4, 0.8 and 0.9, respectively. The probability of the event ~An~B is 0.1 and the probability of the event ~An~C is 
0.05. Find the probability of the event ~Au~BU-C if the event ~B and event ~C are mutually exclusive. 


Submit 


Figure 6: The Pyrenees interface, with the problem 
statement on the top, the variable window in the 
middle, the equation window at the bottom, and 
the dialog window on the right. 


When training on Pyrenees, students were required to com- 
plete 4 phases: 1) pre-training, 2) pretest, 3) training, and 
4) post-test. During the pre-training phase, all students 
studied the domain principles from a probability textbook. 
The students then took a pretest which contained 10 prob- 
lems. The textbook was not available. Students were not 
given feedback on their answers, nor were they allowed to 
go back to earlier questions. During the training phase, stu- 
dents received the same 12 training problems in the same 
order on Pyrenees. Each domain concept was applied at 
least twice. The minimum number of steps needed to solve 
each training problem ranged from 10 to 50. The number 
of domain principles required to solve each problem ranged 
from 3 to 10. Finally, all of the students took a post-test 
with 20 problems. Both pretests and post-tests were graded 
in a double-blind manner by a single experienced grader (not 
the authors), and were normalized in the range of [0,1]. We 
collected six semesters of data from Pyrenees, including Fall 
2016, Spring 2017, Fall 2017, Spring 2018, Fall 2018, and 
Spring 2019. The overall dataset comprises 102,948 data 
points from 1190 students, with 207, 159, 215, 161, 261 and 
187 from each semester, respectively. The detailed statistics 
for Pyrenees dataset are shown in Table 2. 


3.2.2. Quantized Learning Gain 

In the context of Pyrenees, we applied all the models for 
student learning gains prediction. The concept of learning 
gain is formally defined as the difference between the skills, 
competencies, content knowledge and personal development 
demonstrated by students at two points in time [28]. 
used a qualitative measurement called Quantized Learning 
Gain [24, QLG] to determine whether a student has bene- 
fited from our learning environment. QLG is a binary quali- 
tative measurement on students’ learning gains from pretest 
to the posttest: high vs. low. To infer QLGs, students were 
split into “low”, “medium”, and “high” based on whether they 
scored below the 33rd percentile, between the 33rd and 66th 
percentile, or higher than the 66th percentile in pre-test and 
post-test respectively. Once a student’s pre- and post-test 
performance groups are decided, the student is a “high” QLG 
if he/she moved from a lower performance group to a higher 
performance group from pre-test to post-test or remained in 
“high” performance groups; whereas a “low” QLG is assigned 
to the student if he/she either moved from a higher perfor- 
mance group to a lower performance group from pre-test to 
post-test, or stayed at a “low” or “medium” groups (as shown 
in Figure 7). In Figure 7, solid lines represented the forma- 
tion of the high QLG groups and dashed lines represents the 
formation of the low QLG groups, and they will be coded 
with “1” and “0” respectively for QLG prediction. As a re- 
sult, we have 487 of 1190 students in the high learning gain 
group, and the remaining 703 students in the low learning 
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Figure 7: Quantized Learning Gain 


Students usually need to spend 2-4 hours to complete the 
Pyrenees tutor. Thus we are given the first up to n percent- 
age of a student’s sequence data to predict student QLG, 
and our goal is to predict whether the student will benefit 
from our tutoring system in the end. As with the success 
prediction in iSnap, we left-aligned all the students’ trajec- 
tories by their starting times and our observation window 
includes the data from the very beginning to the first n per- 
cent of the whole sequence. 
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Table 2: Detailed data statistics for Pyrenees, including total steps, total time spent in hours, time intervals in 
seconds, corresponding decayed time intervals, and the QLG labels distribution for each of the six semesters. 


s Total Steps Total Time (hours) Time Intervals (seconds) Decayed Time Intervals | QLG Labels 
emester . " . A . " 
min max median mean(std) | min max median — mean (std) min max median mean (std) mean (std) low high 
F16 12 144 78 75 (25) 0.545 173.553 4.142 15.039 (25.56) 1 542136 31 731.799 (10876.82) 0.298 (0.11) 80 127 
S17 59 152 88 94 (23) 0.642 240.661 2.643 17.492 (38.29) 1 861636 25 685.094 (14329.39) 0.314 (0.11) 59 100 
F17 38 148 113 105 (25) 0.773 576.055 5.100 24.335 (64.29) 1 1287547 24 844.083 (20941.73) 0.313 (0.11) 105 110 
$18 23 138 73 71(21) 0.587 135.597 — 2.682 9.431(18.33) 1 354272 27 486.703 (7021.54) 0.307 (0.10) 47 114 
F18 26 162 86 88 (23) 0.679 165.559 = 4.024 14.914 (22.54) 1 438986 28 613.861 (8924.83) 0.301 (0.10) 98 163 
$19 12 138 81 83 (21) 0.571 170.116 4.613 16.909 (27.56) 1 609641 28 738.505 (11439.02) 0.305 (0.11) 98 89 


3.2.3 Four Models 

In the task of early prediction of student QLG, we have 
four models involved: LR, BKT, LSTM and T-LSTM. Note 
that we do not compare RTP here because, in Pyrenees, stu- 
dents’ responses are determined not only by their underlying 
knowledge state, but also by the pre-designed turn-taking 
nature of the system, which could obscure the temporal pat- 
terns found by RTP. 


Logistic Regression (LR): As with student success pre- 
diction, the “Last Value” approach was applied to the non- 
temporal LR for the task of predicting student learning 
gains, as well as the early prediction setting. For exam- 
ple, when the training data is the first 30% sequence, only 
the last value before 30% of each sequence was applied for 
both training and testing. 


BKT: To train the BKT model for QLG prediction, two 
steps were involved. In the first step, the probability of a 
student being in the learned state on each KC at the last at- 
tempt was learned from the BKT model. And in the second 
step, the output of the first step was computed as features 
for our prediction tasks. That is, the number of features 
involved here equals to the total number of KCs involved. 
The logistic regression was applied to predict QLG. As with 
early prediction setting of student success, only the truncated 
training sequences were applied to learn student learning 
probabilities from BKT. 


LSTM and T-LSTM: In order to better compare LSTM 
and T-LSTM performance with BKT, the same two types 
of features were applied here for QLG prediction: 1) the 
assignment of KCs corresponding to each step, and 2) stu- 
dent performance at each step, i.e, correct or incorrect. As 
shown in Table 2, the time intervals of Pyrenees range from 
1 second to 14 days across the six semesters, with 4 = 0.307 
and o = 0.107 for overall decayed intervals. For both LSTM 
and T-LSTM, we used one hidden layer with 64 hidden neu- 
rons and also set the maximum length to accommodate the 
longest sequence in our data. Again, only those events hap- 
pening within our observation window from each sequence 
were applied for training and testing of early prediction. 


3.3. Evaluation Metrics 

Our models in this work were evaluated using Accuracy, Pre- 
cision, Recall, F1 Score, and AUC (Area Under ROC curve). 
Accuracy represents the proportion of students whose labels 
were correctly identified. Precision is the proportion of stu- 
dents who were predicted to be successful by each model 
who were actually in the successful (or high QLG) group. 
Recall tells us what proportion of students, who will actu- 
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ally be unsuccessful (or in low QLG group), who were cor- 
rectly recognized by the model. F1 Score is the harmonic 
mean of Precision and Recall that sets their trade-off. AUC 
measures the ability of models to discriminate groups with 
different labels. Given the nature of the tasks, we mainly use 
Accuracy and AUC to compare different models. Finally, it 
is important to emphasize that all models were evaluated us- 
ing semester-based temporal cross-validation for both tasks, 
which just applied data from previous semesters for training 
and is a much stricter approach for time series data than the 
standard cross-validation. 


4. RESULTS 


4.1 Predicting Student Success in iSnap 

Table 3 shows the performance of all models using the first- 
6-minute training sequences to predict students’ success in 
the programming task. The first column indicates the mod- 
els including majority baseline model using simple Majority 
vote, Logistic Regression (LR), RTP, LSTM and T-LSTM. 
Columns 2-5 report all of the models’ performance for the 
first-6-minute observation window. We evaluated the mod- 
els on different metrics including Accuracy, Precision, Re- 
call, F1 and AUC score; note that we ignored the Precision, 
Recall and F1-measure of the simple Majority baseline. The 
last column reports the mean AUC score of all models from 
0 - 20 minutes, with standard deviations between brackets. 
At first-6-minute, we can observe that T-LSTM outperforms 
all the other models and it contributes the highest score on 
every measurement except that the best Recall comes from 
RTP. LSTM and RTP have very similar performance at first- 
6-minute, and both of them get better performance than 
LR except on Precision and AUC. On the other hand, when 
comparing the overall AUC score among all the models, T- 
LSTM still achieves the highest score. These results suggests 
T-LSTM can better learn the difference between success- 
ful/unsuccessful groups with the help of time-awareness. 


Figures 8 (a) and (b) report Accuracy and AUC performance 
respectively for all models predicting student success. For 
each graph, we vary the observation window from the first 
2 minutes up to 20 minutes. As shown in Table 1, students 
generally take 10 to 60 minutes to complete the task and 
thus we took a measurement every 2 minutes for the first 
10 minutes to generate the early stage predictions for each 
model. T-LSTM is in red, LSTM in blue, RTP in purple, 
LR in green, and majority baseline in black. Both Figures 8 
(a) and (b) show that T-LSTM was the best model for stu- 
dent programming success prediction as it stays on the top 
across all sizes of the observation window. It is not surpris- 
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Table 3: iSnap Student Success Prediction at First-6-minute and Overall Time (0 - 20 minutes) 


Models - first-6-minute Overall 
Accuracy Precision Recall Fl-measure AUC AUC 

Majority 0.6604 - - - 0.5000 0.5000 

LR 0.6038 0.8333 0.5000 0.6250 0.6528 0.7123(+0.08) 

RTP 0.6792 0.7195 0.8429 0.7763 0.6020 0.6948(+0.09) 

LSTM 0.6792 0.7368 0.8000 0.7671 0.6222 0.6755(+0.09) 

T-LSTM 0.7358 0.875 0.7000 0.7778 0.7528 | 0.7512(+0.07) 


Note: best model on each metric in bold 


ing that generally for all the models (except majority base- 
line), the longer the observation windows, the better perfor- 
mance. This is because the training data includes more and 
more information and students get closer to their final state. 
The fact that the best prediction comes from T-LSTM re- 
ally suggests that during the self-paced programming task, 
taking time-awareness into consideration brings us closer to 
the truth of the student learning process, especially for the 
early stage (first 10 minutes). However, this is only one 
observation from one programming task and more research 
is needed to understand the full nature of the benefits of 
time-awareness. 


4.2 Predicting Learning Gains in Pyrenees 
Table 4 shows the performance of all models using the first- 
30%-sequence to predict students’ QLG on the probability 
tutor. The first column indicates the models including ma- 
jority baseline model using simple Majority vote, LR, BKT, 
LSTM, and T-LSTM. Columns 2-5 report the all of the mod- 
els’ performance at the first-30%-sequence observation win- 
dow. As with Table 3, we evaluated the models on Accu- 
racy, Precision, Recall, F1 and AUC score and ignored the 
Precision, Recall and Fl-measure for the simple Majority 
baseline. The last column reports the mean AUC score of all 
models from 0 - 100% sequence, with standard deviations be- 
tween brackets. When only applying the first-30%-sequence, 
T-LSTM generates the best performance on every measure- 
ment except Recall and F1, where the best Recall is from LR 
and best F1 from LSTM. Comparing the two deep learning 
models with classic BKT, we can observe that both LSTM 
and T-LSTM outperform BKT across all metrics. For the 
overall AUC performance, LSTM and T-LSTM have very 
similar scores and are equally good. And still, they achieve 
higher mean AUC scores than BKT, with a lower standard 
deviation. Despite the similar overall performance from the 
two deep learning models, the better early prediction of T- 
LSTM suggests that time-awareness can help to understand 
student learning states earlier. 


The early prediction results for student learning gains in 
probability are reported in Figure 9. BKT is in purple, and 
as in Figure 8, T-LSTM, LSTM, and LR are in red, blue and 
green, respectively. For each graph, the results are measured 
at every 10% increment of the sequence length. Generally 
speaking, the three models (BKT, LSTM and T-LSTM) gen- 
erate better results as the sequence length increases. Both 
Figures 9 (a) and (b) show that the two deep learning mod- 
els outperform BKT for probability, no matter on Accuracy 
or AUC score. While between LSTM and T-LSTM, there 
is not a clear winner. Sometimes T-LSTM gets better per- 
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Figure 8: Student Success Early Prediction on iSnap 
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Table 4: Pyrenees Student QLG Prediction at First-30%-minutes and Overall Time (0 - 100%) 


Models first-30%-sequence Overall 
Accuracy Precision Recall Fl-measure AUC AUC 

Majority 0.5860 - - - 0.5000 0.5000 

LR 0.5839 0.5893 0.9566 0.7293 0.5066 0.4957(+0.01) 

BKT 0.6022 0.6113 0.8819 0.7221 0.5442 0.5690 (+0.03) 

LSTM 0.6226 0.6188 0.9271 0.7422 0.5594 | 0.6013 (+0.02) 

T-LSTM | 0.6328 0.6322 0.8924 0.7401 0.5789 | 0.5950 (+0.02) 


Note: best model on each metric in bold 
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Figure 9: Student Learning Gain Early Prediction 
on Pyrenees 


formance on Accuracy (from 10% to 30%) while sometimes 
LSTM slightly outperforms T-LSTM (from 40% to 70%). 
Overall, LSTM and T-LSTM generate very similar results 
on predicting student QLG; and T-LSTM generally has bet- 
ter performance on the very early stage. 


5. RELATED WORK 


Student modeling has been widely and extensively explored 
in previous research. For example, prior research has pro- 
posed a series of approaches based on logistic regression in- 
cluding Item Response Theory (IRT) [42], Learning Factor 
Analysis [5], Learning Decomposition [4], Instructional Fac- 
tors Analysis [7], Performance Factors Analysis [33], and 
Recent-Performance Factors Analysis [14]. These models 
were implemented with different parameters to better un- 
derstand and model student learning and were shown to be 
very successful. 


BKT [10] is one of the most widely investigated student 
modeling approaches. It models a student’s performance in 
solving problems related to a given concept using a binary 
variable (i.e., correct, incorrect) and continually updates its 
estimation of the student’s learning state for that concept. 
Many extensions of BKT have been proposed to capture the 
complex and diverse aspects of student learning. Pardos 
and Heffernan [31] explored individualized prior knowledge 
parameters based on students’ overall competence. Their 
results showed that the proposed model outperformed con- 
ventional BKT in predicting students’ responses to the last 
question at the end of the entire training. They later in- 
troduced problem difficulty to BKT and found substantial 
performance improvement in predicting student step-by-step 
responses over BKT [32]. Additionally, Yudelson et al. [48] 
parameterized student learning rates in BKT models and 
the results showed that the new model outperformed con- 
ventional BKT in predicting whether the students’ next re- 
sponses were going to be correct/incorrect. Baker et al.[1] 
investigated contextualized guess and slip rates to deal with 
the issues of identifiability and model degeneracy commonly 
observed in conventional BKT. Their results suggested that 
the proposed models achieved better performance in predict- 
ing students’ next-step response than BKT. However, in this 
study, BKT-based models cannot be directly applied to our 
open-ended programming tasks, because of the adversity of 
mapping students’ time-various actions step by step. 


In recent years, extensive research has been conducted on 
deep learning models, especially Recurrent Neural Networks 
(RNN) or RNN-based models such as LSTM. These deep re- 
current models have shown great success in many domains 
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such as speech recognition [17], language translation [26], 
video classification [29], and rainfall intensity prediction [46], 
etc. Their success in all these domains has opened up a new 
line of research in educational data mining [35, 41, 22, 45, 
47, 24, 30]. Mao et al. [27] have shown that LSTM has supe- 
rior performance on the early prediction of student learning 
gains compared with classic BKT-based models. For the 
task of predicting students’ responses to exercises, LSTM 
was shown to outperform conventional BKT [35] and Per- 
formance Factors Analysis [33]. However, RNN and LSTM 
did not always have better performance when the simple, 
conventional models incorporated other parameters. For ex- 
ample, Khajah et al. [22] investigated what statistical reg- 
ularities neural networks can exploit that BKT cannot, and 
showed that BKT with relaxed assumptions can outperform 
LSTM. Wilson et al. [45] also show that Bayesian extensions 
of simple IRT-based models are also equal to or outperform 
RNN-based models on a variety of datasets. 


While most of the previous studies on student modeling fo- 
cus on predicting students’ success and failure in the next- 
step attempt, some research has used student-tutor inter- 
action data to predict student post-test scores [13, 39]. In 
this work, we explored the early prediction of student suc- 
cess and learning gains for a computer-based programming 
system and an intelligent tutoring system, respectively. 


6. CONCLUSIONS 


Early prediction of student learning state is a crucial compo- 
nent of student modeling, since it allows tutoring systems to 
intervene by providing needed support, such as a hint, or by 
alerting an instructor. Both prediction tasks involved in this 
work are challenging because: 1) the open-ended nature of 
iSnap hinders the prediction of student final success, and 2) 
it is extremely hard to track whether a student benefits from 
a tutoring system or not even in a well-defined domain like 
Pyrenees. In this work, we investigated the effectiveness of a 
time-aware model, T-LSTM on the two different prediction 
tasks and compared it with other student modeling methods 
including LSTM, RTP, logistic regression models, and BKT. 
Our results show that T-LSTM consistently outperforms the 
other models such as LSTM, RTP, and non-temporal logistic 
regression on the task of predicting student success in iSnap, 
at all observation windows from first 2 minutes to 20 min- 
utes. On the other hand, for the task of predicting student 
learning gains in Pyrenees, T-LSTM does not outperform 
the other models. More specifically, T-LSTM outperforms 
LSTM and BKT on the early stage with only 30% of the stu- 
dent sequences, and afterward time-awareness does not help 
much when more data is available. One possible explana- 
tion behind this is that in a well-defined domain, the whole 
learning process is mainly driven by the tutor, which makes 
the elapsed time less important to student learning gains es- 
pecially when the step-level performance is available. How- 
ever, in the open-ended programming environment, students 
are self-prompted to complete an assignment; and therefore 
the amount of time they stayed in a state really matters to 
understand their learning. And therefore, T-LSTM can gen- 
erate better performance by modeling the student dynamics 
of knowledge in continuous time than other methods in dis- 
crete tumesteps. 


One limitation of this work is that we only explored one im- 
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portant student modeling task in each learning environment. 
An important direction for future work is to investigate the 
time-aware model on other student modeling tasks in both 
learning environments to determine whether the same re- 
sults will hold. In addition, we are planning to employ 
the time-awareness to other models such as RTP to explore 
whether it continues to support improvement for the open- 
ended programming environment. Also, this work will be 
applied to larger groups of students and longer program- 
ming tasks, along with integration of more informative fea- 
tures such as intervention and demographic features to de- 
velop more robust models. Additionally, we plan to expand 
our evaluations to longer programs with more complex con- 
structs from both text-based and block-based programming 
languages. 
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